Implementing and training diffusion models step by step
Following the method introduced in the paper, I implemented a single-step denoising UNet. I trained the UNet on the MNIST dataset. The training loss curve is shown below:
The visualization of the different noising processes over α = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0:
Then I visualized denoised results on the test set at the end of training. Displaying sample results after the 1st and 5th epoch. The results are shown below:
Visualization of the denoiser results on test set digits with varying levels of noise over α = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0. The result is shown below:
Time conditioning involves encoding the timestep t into a latent representation and injecting it into the UNet to inform the model about the stage of the denoising process, as introduced in the paper.
I trained the time-conditioned UNet on the MNIST dataset. The training loss curve is shown below:
Sampling from the model trained above. I sampled after 5 epochs and 20 epochs:
Class conditioning allows the UNet to generate images conditioned on a specific class (e.g., digits 0–9 for MNIST). This is achieved by injecting a one-hot encoded class vector into the network. The training loss curve is shown below:
Sampling from the class-conditioned model trained above. I also sampled after 5 epochs and 20 epochs: